Fixes test_dns_resolv_conf.py failures due to stale DNS config in some of the containers#26188
Fixes test_dns_resolv_conf.py failures due to stale DNS config in some of the containers#26188purush-nexthop wants to merge 1 commit intosonic-net:masterfrom
Conversation
|
|
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
e7bb017 to
846257c
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…alls to update-containers can update containers with stale dns config Signed-off-by: purush-nexthop <[email protected]>
846257c to
58f625c
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@oleksandrivantsiv @mlok-nokia please can you review |
| fi | ||
|
|
||
| # Check if networking service is active (only for bulk updates) | ||
| networking_status=$(systemctl is-active networking.service 2>/dev/null) |
There was a problem hiding this comment.
This change will affect the performance of the config reload command, as it will run an additional bulk update for all containers after the networking service is stopped.
Is the issue you are trying to fix a recent degradation?
There was a problem hiding this comment.
Please review the following PRs. They fix similar issues:
#25991
sonic-net/sonic-utilities#4365
There was a problem hiding this comment.
Thanks for reviewing this.
No, this change is not to fix the degradation. During config reload (after removing the DNS config), some of the containers come up with stale config and they never get updated (as update_containers was bailing out becos of the networking check). Let me take a look at the PRs you posted and see if it could help here.
|
@Aravind-Subbaroyan we are seeing similar issue in 202511 runs. |
Why I did it
test_dns_resolv_conf.py removes DNS config and checks if /etc/resolv.conf is updated in all containers. Test fails as some of teh containers like pmon/restapi still had stale DNS config on their /etc/resolv.conf
Work item tracking
How I did it
When DNS config is removed,
All containers are restarted
config load minigraph is called with empty nameserver which results in all containers getting restarted.
Updating host /etc/resolv.conf
2a. Hostcfd (sonic-host-services/scripts/hostcfgd) detects changes to restarts resolv.config service
2b. /usr/bin/resolv-config.sh updates /etc/resolv.conf
Networking service restarts
(3a) config load minigraph restarts sonic target which results in triggerig interfaces-config.service to call interfaces-config.sh
(3b) interfaces-config.sh calls "systemctl restart networking" to restart the networking service
Per container /etc/resolv.conf update
When containers come up, update containers get called with container name as their arg. Pls see files/build_templates/docker_image_ctl.j2. Update-containers when called with container name, reads updates /etc/resolv.conf in the container based on the contents of host /etc/resolv.conf.
(4a). If containers come up after the host /etc/resolv.conf is updated - NO ISSUES
(4b) Containers that came up before the host /etc/resolv.conf update will have stale config in them
Bulk update to containers -
Any changes to /etc/resolv.conf willl result in running all teh scripts under /etc/resolvconf/update-libc.d/ path. As a result udpate-containers should get called without the container name (to update all containers).
During bulk update, udate containers goes through all active containers and call update_container_resolv for each container which updates /etc/resolv.conf inside container based on values read from host etc/resolv.conf.
update-containers exits early in case if networking service is not up. This came in as part of -
3a143ad
Optimization to skip updates during warm reboot
Sequence that lands in problem state:
All containers restarted along with networking service
pmon/restapi came early with stale DNS config. All other containers came in after the update to host /etc/resolv.conf.
update_containers is called for bulk update. Exits early becos networking service is down
networking service comes back up
pmon/restapi still continues to have stale config
Fix here removes the networking check in update containers.
How to verify it
Ran the test multiple times on different products and ensured that the test passes consistently.
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)